training speed
Rethinking Memory and Communication Costs for Efficient Data Parallel Training of Large Language Models
Recently, various strategies for distributed training of large language models (LLMs) have been proposed.By categorizing them into basic strategies and composite strategies, we have discovered that existing basic strategies provide limited options in specific scenarios, leaving considerable room for optimization in training speed.In this paper, we rethink the impact of memory and communication costs on the training speed of LLMs, taking into account the impact of intra-and inter-group communication performance disparities, and then propose a new set of basic strategies named the \textbf{Pa}rtial \textbf{R}edundancy \textbf{O}ptimizer (PaRO).PaRO Data Parallelism (PaRO-DP) accelerates LLM training through refined model state partitioning and tailored training procedures.
MEST: Accurate and Fast Memory-Economic Sparse Training Framework on the Edge
Recently, a new trend of exploring sparsity for accelerating neural network training has emerged, embracing the paradigm of training on the edge. This paper proposes a novel Memory-Economic Sparse Training (MEST) framework targeting for accurate and fast execution on edge devices. The proposed MEST framework consists of enhancements by Elastic Mutation (EM) and Soft Memory Bound (&S) that ensure superior accuracy at high sparsity ratios. Different from the existing works for sparse training, this current work reveals the importance of sparsity schemes on the performance of sparse training in terms of accuracy as well as training speed on real edge devices. On top of that, the paper proposes to employ data efficiency for further acceleration of sparse training.
A Bayesian Perspective on Training Speed and Model Selection
We take a Bayesian perspective to illustrate a connection between training speed and the marginal likelihood in linear models. This provides two major insights: first, that a measure of a model's training speed can be used to estimate its marginal likelihood. Second, that this measure, under certain conditions, predicts the relative weighting of models in linear model combinations trained to minimize a regression loss. We verify our results in model selection tasks for linear models and for the infinite-width limit of deep neural networks. We further provide encouraging empirical evidence that the intuition developed in these settings also holds for deep neural networks trained with stochastic gradient descent. Our results suggest a promising new direction towards explaining why neural networks trained with stochastic gradient descent are biased towards functions that generalize well.
Speedy Performance Estimation for Neural Architecture Search
Reliable yet efficient evaluation of generalisation performance of a proposed architecture is crucial to the success of neural architecture search (NAS). Traditional approaches face a variety of limitations: training each architecture to completion is prohibitively expensive, early stopped validation accuracy may correlate poorly with fully trained performance, and model-based estimators require large training sets. We instead propose to estimate the final test performance based on a simple measure of training speed. Our estimator is theoretically motivated by the connection between generalisation and training speed, and is also inspired by the reformulation of a PAC-Bayes bound under the Bayesian setting. Our model-free estimator is simple, efficient, and cheap to implement, and does not require hyperparameter-tuning or surrogate training before deployment. We demonstrate on various NAS search spaces that our estimator consistently outperforms other alternatives in achieving better correlation with the true test performance rankings. We further show that our estimator can be easily incorporated into both query-based and one-shot NAS methods to improve the speed or quality of the search.
- North America > United States > California (0.04)
- North America > United States > Arizona > Maricopa County > Scottsdale (0.04)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- (2 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.95)
- Europe > Germany > Rhineland-Palatinate > Kaiserslautern (0.04)
- Asia > China > Guangdong Province > Shenzhen (0.04)
- North America > Canada (0.04)
- (3 more...)
- North America > United States > Illinois > Champaign County > Urbana (0.04)
- Asia > China > Jiangsu Province > Nanjing (0.04)
- Asia > China > Guangdong Province (0.04)
- Africa > Sudan (0.04)
on both our theoretical contributions showing an equivalence between a notion of training speed and the Bayesian
We thank the reviewers for their helpful feedback. We now address some concerns. We have replicated the DNN experiments (S4.2) We can derive this result using Jensen's'I was not able to ascertain how the result of Theorem 2 is used in the text, I'd be happy if the authors could clarify.' 'I found that the transition to the neural networks remains a bit confusing.' 'how much the results support the marginal likelihood-based model selection hypothesis, or whether they should more
Appendices: Contextually Affinitive Neighborhood Refinery for Deep Clustering A More Experimental Results A.1 Training Efficiency
We show the training efficiency of ConNR by comparing its training speed with a standard efficient SSL baseline BYOL. ConAff neighborhood can be injected into the group-aware concordance loss. Tiny-ImageNet which consists of 200 classes with 10,0000 training images in total. Table 5 indicate that our approach can successfully scale to large datasets. These outcomes demonstrate the effectiveness and scalability of our proposed method when applied to Tiny-ImageNet.